EXPLORING WINE QUALITY by STEFAN KONECNY

This data contains information about quality of (Vinho Verde) white wines from Minho region (in Portugal). It contains objective measurements of 11 chemical attributes of 4898 different white wines along with a quality score. The data set is tidy and complete i.e. there are no data entries missing.

The quality score is median value from an expert jury with at least three judges. It would be very interesting to have also the scores of individual judges, so I could see the variance in scores. Unfortunately only the aggregate value is provided.

Quality takes a numerical value, but it is effectively an ordered factor (and is treated accordingly). According to dataset description possible values for quality range from 0 (very bad) and 10 (very excellent). The academic paper [1] referencing the dataset suggests that expert judged the wine on this numeric scale (as opposed to a qualitative scale later mapped to numbers).

Univariate Plots Section

Quality and Rating are the only two factor variables. The remaining variables are numerical.

According to the dataset description numerical variables can be divided into:

  1. those measured on a normalized scale: pH and alcohol
  2. those measured in weight per volume: the rest
    • some of these values my depend on each other, such as free and total sulfur dioxide

The value range for all variables is quite narrow and similar for all ratings. The distributions don’t have a significant long tail as they had with diamond prices or facebook friend counts.

Let me start with factoring alcohol and pH through rating. Will set the scale to ‘free_y’ so the less frequent rating are not flattened.

The shape of the alcohol histogram is very interesting. It doesn’t look much like normal distribution. It has multiple peaks.

Interestingly when I factored the histograms by rating, one can notice strongly compementary trends in lower and higher rating wines. Many bad and medium wines have lower alcohol content. On the other hand many good and excellent wines have a high alcohol content.

The histogram of total distribution of pH resembles a bell curve very closely. It is left skewed and has a bit of a long tail to the right.

Such trends are even more pronunced in most of the other variables. Chlorides are good example.

There are also couple of exceptions to these general trends. These are the already discussed alcohol, residual.sugar and sulphates.

Investigating individual variables

Chlorides

This plot suggests that the median for good and excellent wines is less the the first quartile for medium and bad wines. This means that good wines are statistically more likely to have lower value of chlorides then bad ones. The summary confirms this.

## wines_w$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01300 0.03750 0.04600 0.05056 0.05400 0.29000 
## -------------------------------------------------------- 
## wines_w$rating: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03700 0.04400 0.04774 0.05100 0.34600 
## -------------------------------------------------------- 
## wines_w$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## -------------------------------------------------------- 
## wines_w$rating: excelent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03550 0.03801 0.04400 0.12100

So if I would discard all wines with chlorides above 0.03700 (the median for good wines), I would discard majority of bad and medium wines and still keep a significant proportion of good and excellent ones.

If I decide to focus exclusively on excellent wines, I can go even tighter and remove wines with chlorides above 0.0355 (median for excellent wines).

Another observation is that value for all quartiles and the mean decrases with rating. So there is a strong trend there.

Ratio between free.sulfur.dioxide and total.sulfur.dioxide

For free.sulfur.dioxide boxplot reveals a promise of a trend allowing me to discard some bad wines.

Interesting the ratio of free and total sulfur dioxide reveals an even clearer trend. This is true despite of the total sulfur dioxide not showing a strong trend on its own.

Once again I validate this insight by inspecting the the statistical summary.

## wines_w$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03371 0.10540 0.16130 0.18880 0.23850 0.65680 
## -------------------------------------------------------- 
## wines_w$rating: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02362 0.18810 0.25000 0.25240 0.31120 0.71050 
## -------------------------------------------------------- 
## wines_w$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0500  0.2118  0.2717  0.2757  0.3333  0.6429 
## -------------------------------------------------------- 
## wines_w$rating: excelent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07895 0.22310 0.28770 0.28930 0.33620 0.60380

In this case median of good wines is above the 3rd Quartile of bad wines. And this allows me to filter out large majority of bad wines.

Similarly to the chlorides all Quartiles and the mean follow a clear (this time growing) trend with the rating.

Density

I did the same analysis for density.

## wines_w$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9960  1.0000 
## -------------------------------------------------------- 
## wines_w$rating: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9923  0.9944  0.9945  0.9966  1.0390 
## -------------------------------------------------------- 
## wines_w$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## -------------------------------------------------------- 
## wines_w$rating: excelent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010

This time median for good and excellent wines is bellow the first quartile for medium and bad ones. So I can again discard some medium and bad wines as with chlorides.

In this case the quantiles and mean do not follow a clear trend w.r.t to rating.

Alcohol

## wines_w$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.10   10.17   10.80   13.50 
## -------------------------------------------------------- 
## wines_w$rating: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.00   10.27   11.00   14.00 
## -------------------------------------------------------- 
## wines_w$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## -------------------------------------------------------- 
## wines_w$rating: excelent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.65   12.60   14.00

Alcohol content almost follows a growing trend for quantiles and mean w.r.t ratio. I decided not to filter out wines with low alcohol content, since alcohol content is typically known to the consumer (unlike the other investigated variables). Hence removing wines with low alcohol volume may reduce choices for consumer with this preference.

Filtering

Based on the above observation I have created a filtered dataset. I kept wines with following properties:

  1. chlorides < 0.037 (median for good wines)
  2. ratio.sulfur.dioxide > 0.2717 (median for good wines)
  3. density < 0.9918 (median for good wines)
  4. alcohol > 11.00 (3Q for medium) - NOT USED.

The filtered data set contains 374 wines, which represent 7.63% of the original dataset of 4898 wines.

More importantly the distribution of the rating changed significantly as apparent from this grid plot. The dashed shape in the backgound shows the barplot with all the wines (before filtering).

Distribution of wines before filtering

## 
##           3           4           5           6           7           8 
## 0.004083299 0.033278889 0.297468354 0.448754594 0.179665169 0.035728869 
##           9 
## 0.001020825

Distribution of wines after filtering

## 
##           4           5           6           7           8           9 
## 0.005347594 0.053475936 0.403743316 0.443850267 0.085561497 0.008021390

The original distribution is in the top left bar plot and the filtered one is in the bottom left one. The ratio of < 6 wines (short for wines with quality 6 and below) is much smaller. Wines with the lowest quality 3 were completely removed. The most frequent quality is now 7 (44.4%), whereas originally it was 6 (with 44.9%).

More then 98.5% of < 6 wines were removed. 93% of 6’s were removed in contrast to only 81% / 82% for 7/8’s and only 40% of 9’s.

Therefore there is a higher ratio of wines of above 6 (good and excellent, 53.75%) then below 7 (medium and bad, 46.25%).

Because there is now similar number of 6’s and 7’s left it will be easier to spot differences and commonalities between the two groups.

Impact of filtering

Next I have re-examined the variables again, this time considering only the filtered wines. I was curious whether new trends have emerged. This is indeed the case. The most pronounced trend can be observed for residual sugar.

Residual sugar

## filtered_wines_w$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.20    1.55    1.90    1.90    2.25    2.60 
## -------------------------------------------------------- 
## filtered_wines_w$rating: medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   1.100   1.600   2.241   2.600  10.800 
## -------------------------------------------------------- 
## filtered_wines_w$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.600   2.400   3.038   4.212   9.700 
## -------------------------------------------------------- 
## filtered_wines_w$rating: excelent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   1.700   4.200   3.671   4.975   8.300

Once again I could remove majority (more then 75%) of wines < 5 while certainly keeping at least 50% of remaining > 6 wines. To do this, I would need to remove all wines with residual sugar > 2.6 (3Q for medium wines, blue dotted line).

In the first round of filtering the aim was to remove bulk of bad and medium wines. Now I have more control. I could also use residual sugar to separate good and excellent wines by keeping only wines with residual sugar above 4.212 (3Q good, green dotted line).

Options for further filtering

Besides residual sugar, I could use value of total.sulfur.dioxide > 113.8 to remove majority of bad and medium wines

Or use the value of citric.acid > 0.3375 to separate good and excellent wines (after I removed the medium wines through residual sugar or total sulfur dioxide value).

Alcohol remains a promising candidate for removing most bad and medium wines, while keeping at least the half of excellent ones.

In summary I have many more options for additional filtering:

  1. residual.sugar > 2.6 (3Q medium) or > 4.212 (3Q good)
  2. total.sulfur.dioxide > 113.8 (3Q bad > 3Q medium)
  3. alcohol > 12.4 (3Q medium)
  4. citric.acid > 0.3375 (3Q good)

The trends in the filtered wines appear much clearer. They also allow for filtering not only medium wines but even to reduce the proportion between good an excellent ones.

However I was not able to complete remove bad wines with simple pruning described above. If I apply all the filtering steps simultaneously, there would only 9 wines left: 5 bad, 2 medium and 2 excellent. But if I skip the citric acid step, I will have 5 bad, 11 medium and 6 excellent. The latter distribution is seems much more promising.

Next I would like investigate how much filtering reduces the variety of wines. I will do that by drawing bi- and multivariate plots.

Univariate Analysis

What is the structure of your dataset?

There is one dependent variable quality describing the quality of the wine. The remaining variables are results of various chemical measurements. These variables all seem to follow normal distribution.

What is/are the main feature(s) of interest in your dataset?

Main feature of interest is clearly the quality of wine. This variable is effectively a ranking ranging from 0 (very bad) and 10 (very excellent). This dataset contains only wines with quality between 3 and 9. However there are only very few wines of quality 3 and 9, and the most wines has the the quality 6, 5 and 7.

table(wines_w$quality)
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

The interesting question is how is quality determined by the other variables.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I have identified a number of variables (and their critical values) which allow me to distinguish wines >= 7 (quality 7 is good, 8 and 9 is excellent) from medium (5 and 6) and bad wines (below 5).

  1. chlorides < 0.037 (median for good wines)
  2. ratio.sulfur.dioxide > 0.2717 (median for good wines)
  3. density < 0.9918 (median for good wines)
  4. alcohol > 11.00 (3Q for medium) - NOT USED.

I found the variables and values by inspecting the boxplots and looking at statistical summaries. By removing wines not meeting those characteristics (except alcohol) I have obtained a much smaller dataset of 374 wines (7.63% of total). In the filtered dataset the majority of wines is good or excellent (44.39% and 9.36%). These proportions were much smaller in the original dataset (17.97% and 3.67%). This shows that the majority of bad and medium wines was removed.

Repeating the analysis for filtered wines, I identified more options for filtering:

  1. residual.sugar > 2.6 (3Q medium) or > 4.212 (3Q good)
  2. total.sulfur.dioxide > 113.8 (3Q bad > 3Q medium)
  3. alcohol > 12.4 (3Q medium)
  4. citric.acid > 0.3375 (3Q good)

Experimenting with these additional step I ended up with only 17 wines: 5 bad, 11 medium and 6 excellent.

Did you create any new variables from existing variables in the dataset?

Yes. I created an ordered factor rating witch is a simplified version of quality I have use ford plotting. The possible values are bad (quality 3,4), medium (quality 5,6), good (quality 7) and excellent (quality 8,9).

I introduced ratio.sulfur.dioxide which a numeric variable whose value is free.sulfur.dioxide / total.sulfur.dioxide

I also introduced an auxiliary variable filtering to draw before/after boxplots.

Bivariate Plots Section

I have plotted the various variables (e.g. sulphates) against quality in scatterplots. To remove outliers I have plotted only values which fell into the interval [median - 2*IQR, median + 2*IQR] (within the whiskers for a box plot). I also plotted the some statistical summaries among the data points: mean (red dot) and 1st (yellow), 2nd / median (orange) and 3rd quantile (brown). The datapoints were coloured according to their rating.

In general scatterplots revealed more structure but mostly confirmed observations from boxplots. Take for example sulpahtes.

The majority of wines is of quality 5 to 7. There appears no clear pattern linking suplhates to the quality (and therefore rating). This is the case for most variables apart from few exceptions.

The exceptions are the same I have identified from boxplots:

  1. chlorides < 0.037 (median for good wines)
  2. ratio.sulfur.dioxide > 0.2717 (median for good wines)
  3. density < 0.9918 (median for good wines)
  4. alcohol > 11.00 (3Q for medium) - NOT USED.

There is one more interesting case volatile acidity. In hindsight the trend was apparent in the boxplot/summary as well. But when I looked at the boxplot I was focused on discriminating bad and medium wines from good an excellent (as opposed to removing bad ones).

Anyhow adding filtering by volatile acidity would not improving filtering. Actually it would make things much worse. Removing wines through filtering is very delicate trial and error procedure. One has to be carefull not to remove too many wines.

I have also calculated the Pearson’s R between quality and the other variables. Interestingly the four variables with the strongest correlation (in a absolute terms) are the same I have found through box plot investigation:

  1. alcohol
  2. density
  3. chlorides
  4. ratio.sulfur.dioxide
##              alcohol              density            chlorides 
##            0.4355747           -0.3071233           -0.2099344 
## ratio.sulfur.dioxide     volatile.acidity total.sulfur.dioxide 
##            0.1972141           -0.1947230           -0.1747372

Even the +/-sign of correlation agrees with the sign of filtering (above or below a given value). This might be a coincidence. Except for alcohol (and perhaps density) to correlations are quite weak.

On the other hand, the aim of filtering is to remove low quality wines, which dominate the distribution. Therefore even variables with relativelly low correlation with quality, could be good candidates for filtrering.

I also looked at the corellation between these variables:

##                      volatile.acidity   chlorides total.sulfur.dioxide
## volatile.acidity           1.00000000  0.07051157           0.08926050
## chlorides                  0.07051157  1.00000000           0.19891030
## total.sulfur.dioxide       0.08926050  0.19891030           1.00000000
## density                    0.02711385  0.25721132           0.52988132
## alcohol                    0.06771794 -0.36018871          -0.44889210
## ratio.sulfur.dioxide      -0.19616085 -0.03321768          -0.01344785
##                          density     alcohol ratio.sulfur.dioxide
## volatile.acidity      0.02711385  0.06771794          -0.19616085
## chlorides             0.25721132 -0.36018871          -0.03321768
## total.sulfur.dioxide  0.52988132 -0.44889210          -0.01344785
## density               1.00000000 -0.78013762          -0.06552475
## alcohol              -0.78013762  1.00000000           0.06446642
## ratio.sulfur.dioxide -0.06552475  0.06446642           1.00000000

Alcohol and density show a very strong correlation. Density is moderately correlated with total.sulfur.dioxide. The correlation between alcohol and total total.sulfur.dioxide is a bit weaker.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I have plotted variables against quality and mostly confirmed the insight from box plots.

What I found quite interesting is that correlation coefficients (with wine quality) quite clearly pointed to the same variables I identified via box plots analysis. This makes sense for variables showing strong correlation. But even variables with weak correlation (bellow 0.2) can be used for filtering.

The correlation coefficients allowed me to judge the strength of correlations. Alcohol has the strongest correlation with quality (0.44). Density comes next with -0.31.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Alcohol and density show a very strong correlation (-0.78). Density is moderately correlated with total.sulfur.dioxide (0.53). The correlation between alcohol and total total.sulfur.dioxide is a bit weaker (-0.45).

What was the strongest relationship you found?

The negative correlation between alcohol and density is clearly the strongest one.

Multivariate Plots Section

I started with plotting the variable pairings identified in the previous section. I have removed the outliers (lying outside of the IQR) and scaled the axes to focus on the majority of data points. I have also customized the alpha value, decreasing alpha for medium (forming the overwhelming majority) and increasing the alpha for bad and excellent wines (which are rare).

The plots confirm trends I already know: e.i. high alcohol and low density wines tend to have higher quality. However bad quality wines are scattered all over the plot.

The plots are still very noisy and dominated by medium wines (green). Next I decided to focus only on filtered wines. I will repeat the same analysis as before. First I calculate Pearson’s R between quality and other variables, this time considering only the filtered wines.

##              alcohol       residual.sugar total.sulfur.dioxide 
##           0.26456153           0.25614075           0.20561321 
##                   pH  free.sulfur.dioxide            sulphates 
##           0.13566092           0.12016675           0.10712705 
##     volatile.acidity ratio.sulfur.dioxide        fixed.acidity 
##           0.10312750          -0.08281238          -0.04397290 
##          citric.acid            chlorides              density 
##           0.04242991          -0.02023482          -0.01545611

The resulting list is very different from the one for all wines described previously.

##              alcohol              density            chlorides 
##            0.4355747           -0.3071233           -0.2099344 
## ratio.sulfur.dioxide     volatile.acidity total.sulfur.dioxide 
##            0.1972141           -0.1947230           -0.1747372

There are very few similarities in the two lists. In both alcohol shows the strongest correlation with quality. For filtered wines the difference between alcohol and the second best variable (residual sugar) is very small, in contrast to the other list.

Accidentally both list have three entries with correlation above 0.2 (in absolute terms). But for filtered wines the correlation drops much quicker. I start with the top three variables for filtered wines (alcohol, residual.sugar, total.sulfur.dioxide) against each other.

The plot of residual.sugar against total.sulfur.dioxide reveals an area dominated by good and excellent wines (where residual sugar is between 3 and 9). Next I will zoom into this section of the plot. I will also visualize two additional variables through size and shape.

In plot below size corresponds to alcohol and shape to pH. I have tried different combinations of variables, but I couldn’t find a combination which reliably discriminates between good, medium and excellent wines. On the contrary, I have established that medium and excellent wines can be very similar (examples plotted against a red background).

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Once again I have mostly confirmed trends from my previous investigation.

It was very interesting to see how the importance of features has changed, when I have considered only the filtered wines. In hindsight, it was to be expected. When I removed many wines based on a particular variable, one could expect that this variable will become less important in the remaining wines.

Yet I found the extent of this surprising. Density and chlorides are the features least correlate with quality in the filtered dataset. Prior to filtering, they were the most correlated features (after alcohol).

The plot below nicely shows the difference between the two data sets. The plot for all wines shows a linear tendency, while the plot for filtered wines is more scattered.

It is remarkable how strongly alcohol correlates with quality. I would be curious to know whether this is because of poor quality wines have little alcohol. Or because the high alcohol content numbs the finer distinctions between the wines.

Were there any interesting or surprising interactions between features?

I was curious whether I can find simple criterion to reliably tell apart medium, good and excellent wines. I couldn’t fine one. It was really interesting to see how very similar are two wines which significantly differ in quality.

The author of [1] have applied a number of machine learning techniques and came to a similar conclusion:

In general, the white [wine] data results are better: 60.3/63.3% for classes 6 and 4, 67.8/72.6% for grades 7 and 5, and a surprising 85.5% for the class 8 (the exception are the 3 and 9 extremes with 0%, not shown in the table).

Note that grade 6 wines represent 44.88% of distribution and are detected in 60.3% of cases. Grade 5 wines represent 29.75% and are detected in 72.6%. Grade 7 wines represent 17.97% and are detected in 67.8%.

In summary for 92.6% wines the detection accuracy is just above 2/3. This suggests that reliably separating wines by quality is not easy.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I choose to create no models. The main reason is that quality (the target variable) is ultimately a factor variable, albeit expressed on a numeric scale. Ultimately it is a subjective judgement of three juror on a limited scale.

Base on my investigation I am sceptical about finding an objective and universal relationship between the measured variables and assigned quality. The findings reported in [1] reinforce my scepticism.


Final Plots and Summary

Plot One

Description One

I have drawn scatter plot of various variables (e.g. sulphates) against quality to visualize how distribution of variable varies for quality. The dots are coloured according to the rating of wines, which is variable derived from quality I use through this analysis.

To remove outliers I have plotted only values which fell into the interval [median - 2*IQR, median + 2*IQR] (within the whiskers for a box plot). I also plotted the some statistical summaries among the data points: mean (red dot) and 1st (yellow), 2nd / median (orange) and 3rd quantile (brown).

Both the scatter plots and the statistical summaries suggest that I can draw a line separating the vast majority of bad and medium wines from the rest. The dotted blue cut off line represent this decision boundary. Conversely the wines on the other side of the line are much more likely to be good or excellent.

I use such cut off values for chlorides and density (below the line) and ratio.sulfur.dioxide (above the line) to filter out wines which are bad and medium (and not good or excellent). This way I get a filtered wines dataset. I should also mentioned that I have crated ratio.sulfur.dioxide during the analysis (I divided free.sulfur.dioxide by total.sulfur.dioxide)

I do not consider alcohol content for filtering. People can consciously choose wines by alcohol content, which is unlikely for many other chemical properties measured and discussed here (e.g. chlorides).

Plot Two

Description Two

I chose the plot above to highlight the difference between the original and filtered dataset (described above). I focused on the residual.sugar variable. For the filtered dataset, residual.sugar has the second strongest correlation with wine quality by Pearson’s R (just below alcohol and the difference between them is very small).

The plot shows that filtering has greatly decreased the range of possible values. For residual.sugar in particular, I could easily reiterate the filtering by removing the vast majority of bad and medium wines (below blue dotted line) or focus only on keeping the excellent wines (above the green dashed line).

Apart from removing subpar wines (bellow good), filtering can also accentuate the differences between the ratings.

Plot Three

Description Three

This plot shows two areas dominated by good and excellent wines (highlighted by the yellow background). I have only plotted wines from the filtered dataset which contains a larger proportion of good and excellent (above 53% comparing with bellow 22% for all wines). The filtered dataset is also much smaller and contains only 374 wines (7.63% of total). Thus considering only filtered wines increases greatly the readability of the plot.

I have plotted total.sulfur dioxide against residual sugar. These two variables show the highest correlation (by Pearson’s R) with quality for filtered wines, with exception of alcohol. Alcohol itself is represented by size of dots. Shape represents a level of pH.

The high number of features in this plot is intentional. It is meant to illustrate that it is hard to reliably discriminate even between medium and excellent wines. Cases of very similar wines with very different ratings are highlighted by red backgrounds. I have tried different combinations of variables, but I couldn’t find a combination which reliably discriminates between good, medium and excellent wines.

Despite of removing the vast majority wines, the highlighted wines show a good variety in alcohol, pH and residual.sugar.


Reflection

I have liked this project so much! I have not followed the stream of conscience/plots strictly. I did so many plots that it would be boring to keep them all in. I have just described which plots have I done and kept those which have shown something interesting.

I thing I have overdone this project, by going much deeper then quick exploration. But I was curious how far You can go using purely visualization. Relying purely on machine learning (as described in [1]) would be a sensible option. But then I wouldn’t inspect the structure of the data so closely.

Because of the structure of the project I have first tried to get the best out of univariate plots before moving on. Perhaps I spend to much time on box plots and filtering because of this. Now I would move more freely between different visualizations as need arrives.

On the other hand it was nice that I have rediscovered the same trends through multi variate analysis using Pearson’s R. Pearson’s R was a real delight. In this particular case it has proven a very useful heuristics to identify promising variables. It also allowed me to compare their relative promise.

Of course one has to be careful and always check and verify. Pearson’s R is not guaranteed giving useful meaningful insights, but then neither are most other statistical methods (except the rare cases when it can be shown that their formal requirements hold). But with messy real world data, Pearson’s R has proven to be a surprisingly useful rule of thumb for identifying promising variables for trial and error.

References

[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.